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Description 

[d€Oli The present invention retetes to the compilation and execution of source code for a processor architecture con- 
sisting of a primary processor and one (or more) secondary processors. The invention is particularly, though not exdii- 
iBively. relevant to the architectures ennployhig a reconfigure 
IWXhi A primary processor* such as a Pemiimi processor b a conv 

Corporation) - has evolved to be versatile, in that it Is adapted to handle a wide rage of corvputational tasl^s without 
being optimised for any of them. Such a processor is thus not optinnsed to hancBe efficiently comp u talionaily intensive 
operations, such as parallel sii>word tasks. Such tasks can cause significant bottlenecks in the execution of code. 
[00031 An approach taken to solve this problem is fhe devek^ent of iritegrated circurte specifically acfapted for par- 
ticular applications. These are known as ASICs, or application-specific inte^ated cincuite. Tasks for whk^h such a ASIC 
is adapted aregeneraOyperfonhed very well: however, the ASIC wOl geh^lyperfbnTi poorly, if at all. on tasks Ibr which 
it is not configured. Clearly, a specific IC can be built for a particular appHcatioa but this is not a desirable solution for 
applications that are not central to the operation of a computer, or are not yet detemrined at the time of building tte com- 
puter. It fe thus particularly advantageous for a ASIC to be reconfiguraWe. so that it can t)e optimised for different appli- 
cations as required. The commonest fbmn of architec^e for such devices is ttie fieW programmable gate anay (FPQA), 
a fine-grained processor structiye whidi can be oohficiured to have a structure vyhfch is stnt^i to any qhi»n flppffgatfem' 
Such structures can be used as independent processors in suitable contexts, but are also particularly appropriate to 
use as coprocessors. 

[0004] Such conf durable coprocessors have the potential to iiTprove the perfomianoe of the primary processor. For 
particular taste, code run ineffidenlly ty the primary processor can be extracted and run more 6fFk:ienfly In an adapted 
a^rocessor which has been optimised fo»r ttiat appication. Witti contirtued development of such "appOcation-specrrc" 
secondary processors, the p)ossibility of improving peifbnnance by extracting cfifficult code :to a custom coprocessor 
becomes itiore attractive. A particularly important example in general cpmputnig is the extraction of loop bodies in 
image handling. 

[0005] To obtain tiie desired efficiency gains, it is necessary to determinB as effectively as possilsle how code is to be 
divided fc)etween primary and secondary processors, and to configure the secondary processor for optimal execution of 
its assigned part of the code. One approach is to mark ttie code appropriately on its creation for mapping to coproces- 
sor structures. In 'A C + + compiler for FPGA custom execution units synthesis'. Christian Iseli arid Edtmrdo Sanchez. 
IEEE Symposium on FPGAs for Custom Computing Machines. Napa, Calrlornia, April 1995. a approach is enployed 
whtoh involve mapping of C + + to FPGAs in VLIW (Very-Long Instruction WortO structures after appropriate tagging 
of the Initial code by the programmer. This approach refies on the Irvtial programmer niaking a good choice of code to 
extract initially. 

[0006] An alternative approach is to assess the initial code to detemnne which fhe most aw)ropriate elements to direct 
to the secondary processor will be. Two-Level Hardware^Soflware Partitioning Using CoDe-X". Reiner W. Hartenstein, 
JOrgen Becker and Rainer Kress, in Int. IEEE Symposium on Engineering of Computer Based Systems (ECBS). Frie^ 
drichshafen, Gemnany, March 1 996, discusses a codesign tool which incorporates a profiler to assess which parte of a 
initial code are suitable for allocafion to a coprocessor and which shouM be reserved for fhe primary processor. This is 
followed by a iterative procedure allowir^ for compilation of a subset of C code to a reconfiguraWe coprocessor ardti- 
tecture so that the extracted code can be mapped to the coprocessor. This approach does expand the usage of set: 
ondary processors, but does not fully realize ttie jxytential of reconfiguraWe logic. 

[0007] ConparaWe approaches have been proposed in ttie BRASS research project at the Universify of Beriwley. An 
approach discussed bi -DalapattiOriented FPQA Mapping and PlacemenT. Tim Callahan & John Wawrzyn^ a poster 
presented at FCCMW, Symposium on ReW-ProgrammaWe Custom Computing Machines. April 16-18 1997. Napa 
Vdley. CaHfbnwa (currentiy availaWe on ttie Wortd Wide Web at httpy/www.csi>eri<e- 
ley.edu/k^ojects/brassAicJccmj>oster_tfiurnb43s), uses template structures representative of a FPGA architecture to 
assist in ttie mapping of source code on to FPGA structures. Source code sarnples ai^e rendered as cfirected acycTic 
graphs, or DAGs. and ttien reduced to trees. These and otfier basic graph concepts are setout for exarrple, in "High 
Perfbrniance Compilers for Parallel Computing", Michael WbKe. pages 49to56. Addison-Wesley, Redwood City, 1996. 
but a brief definition of a DAG and a tree folloMfs here. 

[0008] A graph consists of a set of nodes, and a set of edges: each edge is defined by a pair of nodes (and can be 
considered graphically as a line joining ttiose nodes). A graph can be eittier directed or undirected: in a directed graph, 
each edge has a direction. H it possWe to d^ine a patti wittiin a graph from one node back to Hself. ttien tfie graph is 
cydic: if not ttien ttie graph is acyclic. A DAG is a graph ttiat is botti directed and acycfic: ft is tfius a W^rchical stnic- 
tura A tree is a specific land of DAG. A tree has a single source node; tenned "rooT . and ttiere is a unique path from 
root to every ottier node in ttie tree. If ttiere is an edge X-^Y in a tree, ttien node X is termed ttie parent of Y, and Y is 
termed ttie child of X in a tree, a •t>arent node" has one or more "child nodes", but a child node can have only one par- 
ent, whereas in a general DAG. a child can have more ttian one parent Nodes of a tree witti no chfldren are termed leaf 
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-I *. nodes. ' - • . ' * . ^ [■'-'■'. ■ 

! [0009] lnihewbi1iofTlmCaBahan&JohnV\fa^^ 

"tree covieFing* program called Itxn^. 

RetargetaWe C Compiler: D^gn and br^ementation", Christopher W. Fraser and David R Hanson, Ber^airtn/Cunv 
.5 nrungs PutsOsNng Ca. Inc., Redwood City, 199^ especially at pp 373-407. Ibuig lakes as input the source codie trees 
and partitions this input irito chunte that oirrespond to instru^^ 
tree cover. This approach is essoitiafly detenrined by tf^^ 

complex: it involves a botlonvup matching of a tree with patterns, recording all possble matches, followed by a top- 
down reduction pass to detemritne which match of patterns provides the lowest cost. Again, this approach requires a 
:j 10 significant initial constraint in the fbnn^ 

bilities of a reoonfiguratsle architecture; . 

[0010] There is thus a need to develop tedihiques and approach 

terns inyoMng a primary and secondary processor, 

secondary processor, wWch can then be configured as efficiently as possflble to am the extracted code, with a view to 
IS maximising tfie patornianceefridency of the prima^ 

[0011] Accordingly, the invention provides a method of comptfi^ 
^ sor.oonprising: selective extraction of dataftows frm 

t> trees; matching of the trees against each other to deternrnne rrurttnuim edit cost i^elatidnshqss for transformation d one 

tree into another;determinihg a group or a plurality of groups of dataflows on the basis of said nininuim edit cost rela- 
20 tbnships and creating for each groip a generic datalk>w capable of su^ 
generic dataflow or dataflows to detenmine the hardwE^ 
the source code calls to the secondary processor for said gr^ 
resultant source code to the primary processor. 
10012] This approach aJkyws for optimal selec^ 
25 without pr^iKlgement of suitability (by, for exanple. mapping onto predetennined templates) but while stiTI taking full 
j accxxint of the dernands and requirements of the secortfary process^ 

\ ccfet relationships are determined according to the architecture of the secondary processor, arxJ represent a hardware 

cost of a conrespQiiding reconfiguiBtion of the second 

edit cost relationships are entxxfied in a taxonorriy of 
30 [0013] The method fnds its nrost usefiJ a^jplication. where 

allows for reconfiguration of the secondary processor during execution off the source code, as this allows for recortfigu- 

ratkwi of the secoxiary processor to be required during execution of the source code to support each dataflow in the 

group supported by a generk; dataflow. The secondary processor may thus be an application specific instruction proc-^ 

essor, and the processor hardware may be a f ieki programmable gate array or a fiekJ programmable arithmetic an^ 
55 (such as that shown in the CHESS architecture discussed in Appendix 

[0014] Advantageously, the generto dataflow of a group is cajculaited by an appfoximate mapping of dataflows in the 

group on to each other, folbwed t^ a merge operation. 

[0015] An advantageous approach to construction of a generic dataflow is 

dical graphs arxi reduce them to trees by renioval of any links in the directed acy^ 
40 path between a leaf node and the root of a directed acyclical graph, 

which passes through the largest number of intermediate nodes Alternative criteria to the critical path can be adopted 

if nrare approprime to the secondary processor hardwwe (for exanipte^ 

sensitive to the timing of operations in the secondary processor). 

[001 6] An advantageous further st^ can be taken after the creation of a generic dataflow.^ in wrhich the generic data- . 
45 ftow is compared with further dataflows extracted from ttie source code, wherein 

match sufficientiy closely the generic dataflow are added to the generic dataflow. This enables more or all off tiie code 

presem in the source code which is suitable for allocation to the secondary processor to 
j [0017] In the approaches indicated above, the rernoved links are stored after the dre^ 

reduced to trees and are reiriserted into the generic dat£rftow a^^ 
so dataflow. 

[0018] Specific enft>odimerits of the invention are described below. I^ w^^ 
panying drawings, of which: 

Figure 1 shows a general purpose computer architecture to whk:h ^nbodiments of the invention can suitably be 
ss applied: 

Figure 2 shows schen^ticaHy a method of oornpiling source code to a piiniary ^ 
ing to an embodiment of the invention: 

Figure 3 Olustrates a step of conversion of a DAG to a tree enployed in a me^ 
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merit of the tfivenfion; 

' Figure 4a Hlustriates the step of insertion and deletion of nodes and Rgure 4b Olustrates the stq3 of substitution of 
nodes in a tree matching process employed in a method step according to an embodiment of the invention; 
Figw-e 5 shows an edit distance taxonomy provided In an example according to an embodiment of the Invention; 
Figure 6 illistrates a generic dataflow provided in an example acoHding to one iMxxfiment of the invention: . 
figure 7 shows a logical Imerface for allocation of^ s 
to an emkxxGment of the invention; 

RgiFe 8 shows the application of DAGs to dataflows indwJihg multq>lexers to handle conditional statements; and 
Figures 9a to 9d ^ow an i Dustration of ttie merging of candidate dataflows to fomo a generic dataflow in an exanple 
according to an embodiment of theinventionl - - 

[0019J The present invention Is adapted for oonrpilation of source code to an architecture conprising a primary and 
a secondary processor. An exanple of such an architecture Is shown In Rgure 1 . The primary i3rocessor 1 is a conven- 
tional general-purpose processor, such as a Pentium II processor of a personal computer. (Receiving calls from the prl- 
rneay processor 1 and returrvng responses to it are secondary processors 2 and (optionally 4. Each secondary 
processor 2.4 is adapted to inaease tiie computational power and ^icienc^ of the architecture tyy handling parts of tfte 
source code not well handled by tiie primary processor 1. Secoridary processor 4. optionally presmt here, is a decB- 
cated coprocessor adapted to handle a specific function {such as JPEG. DSP or ttie like) - the structure of this coproc- 
essor 4 will be determined by a manufecturer to handle a specTic frequentiy used function. Such coprx>cessor8 4 are 
not the specific subject of tiie present appTtcation. By contrast, tiie secondary processor 2 Is not already <^n^sed for 
a specific function, but is instead configurable to enable improved handling of pails of the source code not well handled 
by tfie primary processor. The secondary processor 2 is advantageously an appHcatibn specTic structure: it can be a 
conventional FPQA. such as tfie XIBnx 4013 or any other member of tfie XHinx 4000 series. An alternative dass of 
reoonf igurabie device, referred to as a field progranvnable arithmetic array, is described in Appendix A her^. Such a 
secondary processor can t:>e configured for high computational efficiency in handling desired parts of the source code 
for an application to be executed by the iarchitectura 

[0020] Also employed in the CGirputer archhecture are memory 3. accessed by the primary prbci^sor 1 and. for 
appropriate type&of secondary processor 2. by tiie secondary pipcessor 2, and ffiput/output channel 5. input/output 
channel 5 here represents an further channels and hardware necessary to ehabje ttie user to interact with tfie proces- 
sors (for example, by programming) and to allow tiie processors to interact wHh ail otiier parts of the computer device 6. 
[0021] The present invention is particularly relevant to the optimised partitioning of source code between primary 
processor 1 and secondary processor 2. which allows for optimal configuration of secondary processor 2 to optimise 
tiie handling of the application embodied in ttie source code by the ard^ecture. A significant contribution is mate by 
thie invention in the selection and extmction of code for use in the secondary proc 

[002^ The approach takea according to an entfxxfiment of tifie invention, is set out in Figure 2. The initial frput to tiie 
process is a body of source code. In principle, tiiis can t>e in any language : the example descrfoed was earned out on 
C code, but tfie person skilled in the art will readily understand how the techniques desafoed could be adopted with 
otfier languages. For example, the source code could be Java byte code: if Java byte code could be so handled, ttie 
architecture of Rgure 1 could be particularty well adapted to directty receiving and executing source oo^ 
the internet. 

[0(123] As can be seen from Rgure 2. ttie first step In ttie process Is ttie Ident^ication of appropriate candidate code 
to be executed by ttie secondary processor 2. Typlcfilly, ttils is done by performing dataflow analysis on ttie source code 
and txjilding appropriate representations of the dataflows presented by selected lines of code (tn most processes, tiiis 
is nomially preceded by a manual piofiDng of ttie code). This is a standard technique in conpiling generally, and appB- 
cation to secondary processors is discussed In, for example, Attianas et al. "An Adaptive Handware (Wlachine Architec- 
ture and Compiler for Dynanrnc Processor ReconfigurafionMEEE International Conference on Conputer Design. 1991 . 
pages 397-400. 

[0024] The approach tatoi here is to build dnreded acyclical graphs (DAGs) which r^resent ttie dataflows of selected 
coda An advantageous way to do this is by usirig a compiler infrastructure appropriately configured for ttie extraction 
of dataflows: an appropriate compiler infrastructure is SUIF. developed by ttie University of Stanford and documented 
extensively at tiie Worid Wide Web site httpi/Zsuif. Stanford. edu/ and elsewhere. SUIF is devised for conpBer research 
for high-performance systems. spedficaDy ihduding systems comprising more ttian one processor. A standard SUIF 
utility can be used to convert C code to SUIF. It is ttien a simple process for one skOled in tfie art to use SUIF tools to 
buld DAGs by perfbnning a dataflow analysis over sections of SUIF and tiien recorcfing ttie results of ttie analysis 
[0025] The extraction of DAGs from source code is a conventional st^. The next st^ in ttie process, as can be seen 
from Rgure 2. is ttie conversion of ttiese DAGs into ttees. This step is a signif fcant factor in making ttie optimal choice 
of code for execution by ttie secondary processor 2. DAGs are complex stiuctures, and difficult to analyse in an effective 
manner. Reduction of DAGs to trees allows ttie aspects of ttie dataflows most important in detemuning ttieir mapping 
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to hardware to be reigned, while simplifying the structure sufficiently to allow analytical approaches to be made signif- . 

icaritly more effective. 
. [0026] Discussion o< the reduction of DAGs to tre«5 is maJe in "High Period 

(as cited atxive). especially at pages 56 to 60. Different terminology is used here from ttiat used in the cited reference; 
5 txjt equivalent and comparable terms are indicated bebw. The type of trees constructed here are directly coniparabl0 

to the "spanning trees" refcNTed to in the cited riefer^ 

[00271 The preferred approach followed 01 the reduction of DAGs to tr^ 

between leaf nodes and the root: this is iDustrated in Figure 3. The critical patti between nodes A and B is in a first 
entxxiiment of tfvs reduction process defined as the one that touches the maxinuim hurrber of riodes. As a DAG is, by . . 

10 definitioa acydfic, distinct paths can t>e defined to meet this crrterioa It is possa^le for there to be different paths 
t>etweefn nodes that have the same maximum number of nodes, txit ttiese paths are Ijkety all to be satisfactory for the 
purpose of tree construction. While making an ait>itFary ejection between these paths is a valid approach, a l4y issue! 
In mapping the source code successfully is scheduling, which depends on tirrdng information: accordingly, where it is 
necessary to make a choice between alternative "criticai paths' it ls desirable to choose the one tfiat woukJ take the 

.15 longest time (in terms of time taken to execute each of the operations represented t>y the nodes in the path). As is dis- 
cussed further below, alternative approaches can be adopted which are based more directly on timing tnfbrmalioa It is 
also desiiabte to adopt a consistent approach in making such choices - otherwise niprphploglcany cfifferent trees can 
result from essentially similar DADs. 

[0028] The process taken in applying tftis first embodinterrt of the c^ 

20 leaf node, every possible path towards the root is chased: as tfie DAG is a directed graph, this is straightforward. As 
indicated at>ove. for each leaf node the path with the greatest number of nodes is chosen, and if two paths are found to 
have the same number of nodes, a selection is made. This is the critical path for that leaf node, AH other paths not 
selected are cut in their edge closest to the starting point This cut edge is termed a minor Gnk (eqirivalent to the term 
"cross-rmk" in the WoWe reference). The tree consists of tfie assembly of critical paths, and contains no minor links. The 

25 minor links are stored s^3arately. Minor links will be required when extracted source code is mapped to secondary proc- 
essor 2. k)ut are not used in det^minirig vy^'ch source code is to be nr^^ 

[0029] It is of course poss&^e to construct aces from DAGs without using the critk:al path critenon. Usie of tfie critical 
path does provide particular advantages. In partk:ular. removal as minor finks of the cross-Bhks not oi the critical path 
win have fittle effect on 8chedufing. wher^ if another approach 
30 erabie influence on tinting and hence on scheduling. Use of the critical pctth critenon allows construction of a tree whk:h 
represents as best possible tfie critical features of the DAG in the 

[0030] Rgire 3 shows the application of the process descrit>ed in the preceding paragraph. Source code extract 1 1 
shows three lines under consideration for execution by secondary processor 2. DAG 1 2 shows these three lines of code 
represi^ed as a directed acydical graph, with root 126 (variable e) and leaf nodes 121 . 129 and 130 as the iiputs. 

35. [0031] It is now a straighHbravaid matter to assess each path from a given leaf node to the root, and to conpare the 
number of nodes in each path. From node 129 fmteger value 2), there is only one path, through nodes 122, 123. 124 
and 125. this is then the critical path from leaf node 129 to root node 126, and will be^'present ih the tree. Fix)m node 
121 (In the present case the result of an earlier operation and didsignated c). there are two paths. The first path passes 
through nodes 122. 123. 124 and 125. whereas the second path passes through nodes 127, 128and 125. Thefirstpath 

40 is the critical path, as it passes tfirough more nodes: the second path can thus be cut as is discussed below. The 
remaining leaf node 130 (variable b) also tias two paths availat)le: one passes through nodes 123, 124 and 125, 
whereas the otfier passes through nodes 127, 128 and 125. These are equivalent In terms of nunt>er of nodes and so 
either path can t>e chosen as the critical path: however, fbr reasons discussed above (timing and morphological con- 
sistency) K is desirable to operate under an appropfiate set of further rules to make the best selecfioa Such further 

45 rules may. for example, t>e determined on the basis of the relevant hardwara Here, the second path is chosen. 

[0032] The next st^ to take is to construct a tree 14 from the critical paths chosen from the DAG 12. This Is done by 
cutting all noncritical paths in their edge closest to the starling point (tfiat is, the edge closest to the starting point which 
is not also part of a critical path). The first non-aitical path to consider is thai from node 1 21 to root 1 26 through nodes 
127, 128 and 125. this can be cut on the edge between nodes 121 and 127 - in the tree, this is represented by removal 

so of edge 151 b>etween nodes 141 (corresponding to 121) and 147 (corresponding to 127) which is stored separately as 
a minor link. The other non-critical path to consider is that from node 130 to root 1 2S through nodes 123. 124 and 125: 
tills can be cut on the edge t>etween nodes 130 and 123. Again, this cut edge is stored as a minor fink. 
[0033] It should be noted tfiat conditionals can be represented in DAGs and so reduced to trees in exactly the same 
way as simple equations. An example is shown in Figure 8: tiiis is a DAG representing ttie dataftow of ttie fines, 

65 
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if(x<2) 

a^b ■ 

else 

and shows a muttiplexer node 186 and a ''less than" operation node 186 in adcfitioh to the variatile and integer nodes 
181. 182, 183 and 184. As me skilled man wiD appredate, tt win 
: lor source code which can k>e represented as a DAG. 
[p034] TTie tree structure that is left -in this case; tree 14 -i^ 
15 source code should be mapped to secondary p-ooessor 2, as is cfiscussed further below. The technique desa^ed 
above is a particulariy appropriate one for converting DAGs to trees, as it is straightfofward to implement, is general in 
appOcation. and through use of the critical path mai^ 

thesised (assutnng each node represents a single computational element) because of the inclusion of paths with the 
maximum number of nodes. As the person sWBed in the art wiO appreciate, altemative approaches to deterrraning whk:h 
20 edges are to be removed in converting the DAGs into trees can be adopted. One altemative embodiment of the DAG to 
tree reductioh process is to assign a timingh^ed weigm to ev 

required to ^cecute the corresponding computational element) and then to compare the accumulated weights of each 
path, selecting a patii to define the tree accordingly on the basis of. for example, greatest accumulated weight Ihis 
approach may be more appropriate the tinriing parametere^ 

25 tor and in particular if the tinning dependencies are not mainly related to the mode counted (which may the case in sbw- 
tures where, for iexampie. muKipHcatfon is several times more time consuming than addition). 
[0035] The next step in the compilation process, as can be seeri from Rgure 2. takes trees as inputs and determin^ 
tiie selection of source code for tiie secondary processor 2. As is further illustrated in Rgure 2, this step of tfie process 
comprises a series of sufcnsteps. The first of these is the analysis and classification of the trees resuHnig from tiie can- 

30 didate datallcws. This is a signiflcam ordinal Step, and is discussed in d^^ 
[0036] Theol:>iectiveintt^stageofthecpnvHlattonproc^ 

dataflows from the source code would be the t>est choices for execution t>y the secondary processor. This is to a large 
degree dependent on the nature of the hardware in the secondary processor. An extremely efficient napping of source 
code to the secondary processor 2 can be made where dataflows are suffidentfy similar that tjroadly the same hard- 

35 ware representation can be used for each dataflow. It therefore foDows that good choices of candidate dataflows for 
mapping to the secondary processor can be made by finding sets of dataflows tiiat are sufficientiy sim9ar to each olher. 
This is what is achieved by analysing and classifying the trees resulting from the candidate dataflows. 
[0037] A powerful technique for matching trees, used In ttiis embodiment off the invention, is the tree matching algo- 
rittim devised by Kaizhong Zhang of ttie Untversfty of West Onta^ 

40 [003q This algorithm is descrfoed In Kaizhong Zhang. "A Ckxistrained Edit Distance Between Unordered Labelled 
trees". Algorithmica (1 996) 1 5:205-222, Springer Veriag. and is provided as a toolkit by the University of West Ontario, 
the toolkit t>eing at the time of writing obtainat)Ie over the Intemet from flpy/np.csd.uwaca4>uh/kzhang/TREEtool.tar.gz. 
It win be appreciated ttial alternative approaches of matching trees to determine a degree of simflarity therebetween are 
available to tiie skilled man. The approach to tree matching used in tfiis embodiment of the invention is descrtoed below. 

4S [0039] The principle of operation of Zhang's algorithm is the following: two tree^ are compared node4>y-node through 
a dynan^c programming technique that minimises tiie edit operations required to transform one tree into another. This 
cost of transfomiation is temned here an edit cost. The edit costs of successively larger subtrees are cross-compared, 
with a record being kept of the minimum costs found. The computational structure can be characterised as that of a 
recursive dynamic program which uses a working dynamic programming grid to calculate component subtree distarices 

so and records the result on the main grid. 

[0040] The ecfit operations availat^le are insertion, deletion and sut>stitution. These are shown in Rgures 4a and 4b. 
Figure 4a shows two trees: tree 1 51 with five nodes and tree 1 52 with six nodes. The structure of the trees can be made 
kJentical by addition of a node between nodes 3 and 5 of tree 151: this new node gives the stoucture of tree 152. Con- 
sequentiy transfomiation of tree 151 to tree 152 is achieved ky insertion of tiiis node, and transformation of tree 152 to 

55 tree 151 is achieved by deletion of It (in the CHESS architecture descrS>ed In Appendix A, "yeletion" is represented in '. 
hardware by Toypass" of a unH of the anay: this is an example of an architecturafly designed cost - in this case, a 
extremely low cost). For Figure 4b, the two trees 151 and 152 have the same structure, but ttie two nodes 4 represent 
a different type off (deration in each tree: it is therefore necessary to substitute for node 4 in transforming one tree to 
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thsi other/Every node therefore heeds a label": a tag attached to the node which identifies the type of node among the 
[0041] Asis-eviousty indicated, eac^ of these edit operato 

for example, the same result may be achieved in some architectures either by an insertion and a deletion, or by a sut>- 

£ stitution: the costs of these different anemattves can become 

[0042] the result of the conpaiison of tMfp trees by this algorithm is the productibn of a list of pairs of nodes (t1,t2). 
where t1t)eiongs to the first tree and 12 belongs to the s^^ ^ 
points in the two trees, suggesting the mapping of t1 and t2 on to each other. The Tist of pairs effectively defines the 
skeleton of a tree which can contain either of the compared trees: in tfils skeleton, to transform the first tree into the 

10 second tree, each node 1 1 has to be substituted with tiie respective 12. Nodes that do not occur in the mapping must be 
either inserted or deleted depending on wdlicht^^^ For this list of pairs 

there wiD be defined an edit distance: this is the nrtinimurh h edit costs cumulated over the pairs necessary to transfiarm 
one tr<^ to the other. Tlie algorithm is devised to deter^^ 

transformations which achieves that edit distance: altematiye transformations will be possible, t)ut they will have a 

IS higher assocated cumulative edit cost. 

[0043] The value of computing an edit distance teased on edit costs is that the edit costs m^ be chosien to represent 
the hardware cosT in reconfiguring the secondary processor fipm the (Configuration representing one tree to a oonfig- 
uration representing the other tree in a mapping. This liaidware cost" is typically a measure of the quantity of second- 
ary processor resckirces that will be taken if> to achieve the se^^ 

20 can l>e considered, for exanpljB. in terms of the additional area of device used. These posts will t>e determined by the 
nature of the secondary processor hardware, as for different types of hardware the pf^cal realisation of insertion, 
deletion and 8ut)stitution operations will be different. For the reconfigurabte CHESS array discussed in Appendix A, a . 
"bypass' operation involves minimal cost, a substitution between an adds and subs (addition and sutTtraction opera- 
tions) h^ k3w cost whereas subsfitutksn between muls and divs (muWpltoatioh and division operErtions) is expensiva 

25 [0044] As indicated above, ah edit distance (between two trees can be constructed. However, a further step can be 
taken: using Zhang's algoritiim. or a comparable approach, a taxonomy can be built to show tfie edit distances between 
each one of a set of trees. This taxonomy can readily be provided in the form of a tree, of whtcfi an example is shown 
in Rgure 5. Each leaf node 161 of the tree represent a candidate tree extracted from a DAG. and each intermediate 
node 162 represents an edit cost The tree provides a unique patti between each pair of leaf nodes. The edit distance 

30 behveen the two leaf nodefs Of a pair b found by nation of costs provid^^ 

example^ the edit distance between any paor of the leaf nodes rqpNresenting Tree#4, Tree#5 or Tree#6 is 6. However, the 
edit distance between Tree#1 and Tre^ is 496: the summation of intermecfiale nodes witii values of 12, 221, 107. 50 
. . arxJe. ." 

[0045] This taxonomy is indicative of the number of edit operations required to translate between trees. Such a tax- 
as onomyisavaluai^etool. asrtcanbeusedheuristicallyasametrk^fbrthedeg^^ 

The creation of a taxonomy thus renders it easy to detemiine which trees are suffcientiy similar to be consolidated 
togettier (as will be discussed below), and which are too diverse tor ttiis purpose. This &n be done ksy imposition of an 
edit distance tiireshold. A group of trees can be selected tor consolidation if the edit distance between each id every 
posstole pair of trees in the grotp is less ttian the edit distance ttveshoW. The value of the edit distance threshold Is 
40 art^trary. and can be chosen tjy the person skilled in the art in the context of spectTic prinrtary and secorKfary processors 
in order to optimise the performance of the system. 

[0046] The advantage of consolidating a group of trees Is that a common hardware configuration can be used for the 
whole group and wil support the tonction of each trea This is particularly appropriate fpr architectures, such as 
CHESS, in which low-latency partial reconfiguration mechanisms are available on the secoridary processor. Reconf tg- 

.45 uration is required to change the conf^ration from that to support tfie function of 

of aniDther tree: however, as tiie edit distarxje l>elween these trees will nev©- be greater than tiie edit cost tfveshold, the 
degree of reconfiguration required is already known to t>e within acceptable tx)und8^ The group of trees are consoli- 
dated together by construction of a "supertree** wfitch contains a representation of every component trea After it has 
been constructed, the supertree can be converted into a representation of each of the relevant DAGs extracted from the 

so source code by reinsertion of the previously removed minor links. The hardware configuration may then be determined 
from the fiifl sipatree. The construction of the supertree Is discussed in detail betow. 

[0047] Rgure 6 illustrates the step of construction cff a sipertree from a group of trees which fall below tiie specified 
edit cost tiveshokJ: such a group of trees is here termed a dass. The trees 171, 172 and 173 can all be mapped 
togetfier into supertree 1 70. The reconfiguration required to change tfie hardware configuration from tfiat to support for 
^ example, tree 1 71 to that of tree 1 72 is sufrk:tentiy limited to be realizatie in practice, because the edit distance between 
the two trees is below the edit cost threshokJ. 

[0048] An exemplary supertree assembly algorithm, merge, is provided as C code in Appendix B. The function of tiie 
aigorittim is descnbed below, with reference to Figure. 9. The algorittim contains the following elenierrts: 
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[0049] The tree in frie dass with the largest number of nodes is chosen to be the initial merge tree - if there are trees 
witti an equal nun^er prf nodes, an arbitrary selection can be mada The remaining trees are termed source trees. 
5 [0050] For each source tree the fQlloimng opet^^ 

From the rnapping between tie source tree and the merge tree which has been calculated fm this ent)odiment. 
from Zhang^ algorfthm and edit cd^ determined from the Sjdcohdary processa architecture), the supertree is 
constmcted as follows: 

10 ■ . ■ • : ■ ' ■ . • ' r •' ' . :- ■; 

i/Fnstly, nriapped nodes ctosiest to the root 

2. The source tree opisration (sowce operation) Is concatenated to the corresporidihgmapped rnerge free 
operation (merge operatidn); 

3. For each child operation of the source operation 

a/H the cNId KB mapped, revrert to step 2 with respect to the sour^ 
b. If the child is not rnapped. then consider w^^ 
is the root (sotiroe subtree): 

^ . i If there is no further mapping, simp^ adopt the source subtreefor meifging into the merge tree under 

the corresponding merge tree node. 

ii. if there is a further mapping inside the source subtree, connect the subtree as follows: 

a. If the nierge operation of this subordinate niappi^^ 

^. remove the mapped source operation from the source tree. There is recursion present at this 

stage • where mapped children have already been dealt with, all that needs to be done is to 
remove wfiat would otherwise be a cross tree link. 

b. This is shown in Figure 9. If the merge operation of this subordinate mapping does fan within 
the previously mapped subtree, dimb up the merge free until ttie least comrnon ancestor for ail 
contained suboixJinate mappings is found. Tlie least comnr^ 

all of tiie source mappings. The unmapped source segment is then mapped into the merge tree 
l>y linking the source operation of the unniapped source subtree as a chikf of the least common 
ancestors parent and by linking the least comnion ancestor as the chad of the un^ 
operation just above the dosest mapped source operation in the curre 
^ . . estmappedsourceoperatipn"delinriitstt)ekiwer erid of anunn^ 

and is a mapped node wNch falls within the subtree of the cun-ent rnapping - the source node'b 

parent, which is unmapped, adopts the merge tree*^ least oorn^ 

versa). 

40 : ■ The pair of imerniingled trees are norniaUsed into a single fr^^ 

The procedure continues imtil all the source trees in ttie dass are conteJned within the merge tree, whtoh is now a 
supertree. 

[0051] This process is indicated in Rgure 9. Rgure 9a shows two dataflow trees, a merge tree 201 and a source tree 
45 202, There are three mappings made between nodes made by the comparison algorithm - the remaining nodes need 
to be Inserted appropriately. As indicated in section 1 above, the first step is to consid^ the mapped operations nearest 
the root • in this case, at the root These operations A are concatenated. 

[OOSQ After this, tiie chiW nodes of A inpe source tree are considered. Node B does not have a mapping and h not 
an ancestor to any mappings - it is therefore merged as a chfld of A:A (see Rgure 9b). The other child node of A, C. 

so does however have descendant mappings (D and F which map to D and E in tiie merge tree). Botii the relevant merge 
operations fall in the previously mapped subtree (as they are both descendants of A), ft Is tfierefore necessary to follow 
the course set out in section 3(bXii)(b) above. The least common ancestor containing both mapped merge operations 
D and E is X. C of the source tree is tiius finked into the merge tree as chiTd of A:A (the parent of X) and parent of X. 
This arrangement is shewn in Rgure 9b - the merging is completed concatenation or merging of ttie remairting nodes 

55 of tiie source tree, all of which steps are straightfbnward. 

[0053] The resultant supertree 203 is shown in Rgure 9c. This supertree 203 acts as merge tree for the merging fri 
of a further cancfidate source tree 204, as shown in Figure 9d. In tiiis case each node of the source tree is mapped into 
a s^)ertree node - merging Is thus entirely stara^hflbnivard, and consists only of concatenation (ie substitution). This 




yaiy, <EP_ogeBS94Ai_L> 



8 



EP0 926 594A1 

pmess contiiuies unjtO all the 
[CXB41 Attl*stage.llisp6ssi^ 

ary processor. The source code will contan DAGte other tfmn those whfch have been selected for inclusion of the siper- 
triee: for example, DAGs which have not been considered because they do not fie at one of the most coiriputationaOy 
5 irtensive'Tiot spots' of the code; However, the code of ft 
ately adapted secondary prpce^ rather than c^^ 

remaining DAGs with the sip&r^BB a backniappihg process. Processes derived from conventional backmapping 
techniques, sudi as Iburg, can be utiGsed for this purpose. 

to use pf Zhang's algorithm, arxl matdi further candidate trees in the source code against the supertree, but this time 
^ 10 with a lower edit cost threshold. Where the trees derived f ronri such DAGs can either be mapped directly onto the siper- 
i ' tree, or where the edit cost for such a mapping fans below some nriininiufn level, then the code of these DAGs can also 

be allocated to the secondary processor and the supertree modified, if necessary. ControMnfbnnation related to any 

euchdataflovis added by this backmapping process needs to b^^^ 

[0055] From this si^>ertree. it is then slrafghtfon^wu^ 
is their conversion into trees (inclucfing here any DAGs added from the t)ackirapping process, if employed). The resultlrig 

structure is a class dataflow, which represents all the information present in the DAGs of the class: control information 
rj forthesMpertreeCforexairple.todetern^ 

y flow can be used for the purpose of determining the hardware configuration of the secorKlary processor, and can also 

be used to provide a structure for enabOng stitching t>ack into the source code appropriate calls to the secondary proc- 

i . so essor: these steps are described further t^elow. 

10056] Stitching calls to the secondary processor back into the source code in fact requires only the supertree. and 
not the class dataflow, as the supertree pre6iat>es the periphery of the dataflow. The actions required with respect to 
any r^laced dataflow in the source code are replacement of inputs of the dataf tow (leaves of the tree reduced from that 
dataflow) with load primitives and of the output of the dataffow (root of the relevant tree) with a read. The leaves and 
25 roots of the relevant tree are contained in the supertree. so only the supertree is required for the p'urpose. AD remaining 
code subsumed in the dataflow can sinply t>e renfK>ved/as it is repla^ 

[0057] Figure 7 shows a logical interface for achieving the necessary sul)stitutions into the source coda An input tree, 
labelled Input Tree #3. is shown, together with a supertree. labelled PFU Tree. Each node in Irput Tree #3 has Hs own 
unique operation ID obtained firom the conpiler internal fomi representation. Fbr the supertree (PFU Tree), registers or 
30 Other I/O resources are aOocated to the leaves and the root The implidt mapping between Irput Tree #3 and PFU Tree 
thus provides a corresponderwe between operation IDs of the Input Tr 
Tree in the fomfi of a specification. The application of this specif icatiw 
roiwval of the code subsumed by the PFU and the substitution of the necess^ 

[0058] From the class dataflow, it is possit)le to configure the secondary processor. TTiis step can be conducted 

35 according to known approaches. l>y reduction of the class dataffow to a netlist (vwt h insert delete and substitute oper- 
ations, and including in appropriate form any dynamic recorifguratfon instructions), and then mapping the netTist to the 
spedTic secondary processor hardware, taking into account requirements of reconfiguratioh t>etween component data- 
fkjws. For conventfonal FPGA architectures, these steps can be carried out essentially by use 6f appropriate known 
tools. Fbr exanple. in the case of a standard Xilinx FPGA such as the XC4013. then appropriate xainx proprietary tools 

40. can be ised. Firstly, the netll^ can be roxlered in Xilinx netlist forn^ 

into configurable toglc blocks and inputAoutput btocks by the XIBnx Partitton Place and Route program (PPR), with the 
resultant being converted to a configuration bitslream by the XHInx MakeBits program. This approach Is discussed, 
together with further discussion of provisfon of predetermined reconfiguralion sdutfons. in ;Run-Time Programming 
Method fbr ReconflguralDle Computer by Steve Casselmaa cun-entiy availatrfe on the WorW Wide Web at 

<s httpy/www.reconfig.«)m/specrept/1 01596feession1/lib^ a contribution to the Worid Wide Web roundta- 

We on reoonfigurabie computing operated by SB Associates. Inc. of 504 Nino Avenue. Los Gatos. CA 95032. USA. 
Essentially similar procedures can be followed for alternative types of configurable arid reconfiguraWe processor, such 
I as the CHESS device described in Appendix A. using tools appropriate to th^ 

[0059] Once the source code is generated in executable fomi with appropriate calls to the secondary processor, and 

so once the secondary processor configuration has been determined, the source code can be loaded and executed. The 
source code is executed in the primary processor with calls to coprocessors and the secondary processor: as the sec- 
ondary processor is spedfically adapted to process the dataflows ejdracted to it, the execution speed of the code is sig- 
nificantiy increased. Fbr example, a 25% improvement was found In applkatfon of the method of this entedment of the 
inventfon to the iDCT algorithm from the JPEG toolkit even though this is In fact a poor problem for mapping to such a 

55 secondary processor because of I/O constrainst 

[0060] The methods here desCTil>€d are thus particulariy effective to aflow fbr opting 
in an architecture comprising a prinrary processor and a reoonf igurabie s^ 
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The CHESS array is a variety of field programmable array in which the progranmiable 
elements are not gates, as in an FPGA, but 4-bit aridmi^ logic units (ALUs). The array 
configination is described in detail in Emxqpeaii Patent Application No. 9730QS63.0, and the 
ALU structore and provision of instruction to ALUs is discussed in a cqpending s^lication 
entitled ''Recx)nfigunJ>le Processor Devices'* ^ filed on the siame date as die in-esent 

T)ie CHESS anay consists of a chessboard layout with alternating squares conqnisiDg an ALU 
and a sy/itchbbx structure respectively. The configuration memyory for an adjacent switc^ibox 
is held in die ALU. ImUvidual. ALUs imay be used in a processing ii>q>dine, and in a preferred 
inqplemraitation, provisk>n is made to allow dynamic provision of instructions from one ALU 
to detennine the funcdon of a succeeding ALU. ALUs are 4-bit, widi four identical bkslices, 
with 4-bit hqiuis A and B taken directty from an extensive 4rbit interconnect wning n^ork, 
and 44iit output U provided to the wiriiig iietwork through an optional^ latcfaable output 
registo-: 1-bit carry iiqmt and output are also provided and haive their own interconnect. 

Dynamic iistrucdons are providable from die output U of erne ALU to a 4-bit instnicdon hspat 
I of another ALU. The cmy oidput C^^ of one ALU can also be used as C|. of anodier ALU 
with tibte effect of changmg the instruction of that ALU. 

The CHESS ALU is adqrted to siqiport multiplexing between A and B inputs, and also 
supports multiplexing bctwecm related instructions (eg OR/NOR, AND/NAND). 
Reconfiguration between such instructions can be adiieved tlirough a p pr op riate, use of the . 
carry iqnits and ou^nits without consunqition of silicon. More complex reconfigurations (eg 
AND/XOR, Add/Sub) can be adiieved throu^ using two ALUs, die first to multqdex between 
the two alternative instructions and the second to execute die chosen instruction on the 
operands. Multiplicaticm will take up mart dian a single ALU, making reconfiguration 
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involving a nmlt^lication cperation marc cpniplex. It is straightfioFward using the multq}Iexer 
capacity of a CHESS ALU to "bypass" an operation, with appropiiate contipl resulting in 
^ ddierpa^fonnanceof theopetatio 

A sanq>Ie sfei of functions obtainable from the instruction iiq>uts is mdicated in Table Al 
bdow: a wide nuige of possibilities are available with s^propriate logic in comiectipn of the 
. instniction iiyyu^ to the ALU^ The functions are described in Table^^^ 
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Table Al: Instruction bits and corresponding functions 
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Table A2: Ou^ts for instnictioiis 



complement arithmetic is used, and the arithmetic carry is provided to be consistent wilh 
satidinietic. The MATXM fimctions are so-calM becairo for MATCHl the vahic of 1 is 
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' .daims;.. 

1. A m^hbd of oompiDng souira code t^^^^ pnrtiary and a secoixlE^ 

selective e)0action of dataflows from the 
transfbrmation of the extracted dataflows into trees; 
nfiatching of the trees against i^ch dth»r to detenttine minimum edit cost relatiorTshq:^ for translormatibn of one 
tree into another; 

determining a group or a phiraffty of groups of dat£rfbws on the t>a8i8 of said minimum edit cost relationships 
and creating for each group a generic datafk^ '". 
u^ng the generic dataf Ipwr or dataflows to determine ttie hardware configuration of the secoridary priocessor ; 
- sM i., • .' \' ■ \ \ --'y ■ • 

substitutaig into the source code calls to the secondary processor for said group or pturaltty of CFOups of data- 
flows, and compillrig the resiAarrt TOurce ocxje to tfie prirra^ 

2. A method as daimed in daim 1 , wherein said mirilnrium edit cost relationships are embodied in a taxonomy of nndn-. 
imum edit distances lor classiflcation of ^ 

3. A method as .dainned in claim i or dalm 2, wt>erein said nriinimum edit cost relationships aria determined according 
to the architecture of the secondary^processor. and represent a hsudvy^ cbrrespdnding reconTiguratiori . 
of the secbrxlary processor. 

4. . A method as daimed in ariy of daims 1 1o3» whereinthe hardware corifiguration of the seconds^ 

for reconf^uration of the secondary processor dur^ 

5. A metlKxj as claimed in daim 4. wherein the secondary processor is an application specific instruction processor. 

6. A method as daimed in daim 4. wherein the secondary (Kocessqr is a field progmnrvnatsle gate anay 

7. A method as dsumed in ckurn 4, wherein t^ 

8. A method as claimed in any of daims 4 to 7, wherein recorrf iguration of the secondary processor is required during 
execution of the source code to support each dataflow in the group s^^ 

. 9. A method as daimed in any preceding dainri. wherein a generic dataflow of a group is calculated t3y an.approximate 
mapping of dataflows In the gtmip on to each other, fdlowed l3y a merge operation^ 

. 10. A method as dainied In any preceding daim. wherein the datafbws are provided as directed ac^ical graphs and 
are reduced to trees by removal of any links in the directed acyclical grs^^hs not present In a critical path t>etween 
a leaf node and the root of a cfirected acydical ^^h. 

1 1. A method as daimed in daim 10. wherein the critical path Is a path between two nodes wHch passes through, the 
largest number of intermediate nodes. 

■ . ^' . ■' . 

12. A method as daimed in daim 10, wherein the critical path is a path between two nodes wHh the greatest aocumiij- 
lated execution time. 

1 3* A method as claimed in any of claims 1 0 to 1 2. wherein after the aeatipn of a generic dataflow, the generic dataflow 
is compared with further dataflows extraded from the source code and provided in the manner defined in daim 10, 
wherein those of said further dataflows which match suffidently dosety the generic dataflow are added to the 
generic dataflow. 

14. A method as darmed in any of daims 10 or claim 13 where dependent on daim 9, wherein the removed links are 
stored after the directed acyclical graphs are reduced to trees and are reinserted into the generic dataflow after the 
merging of the trees of the group into the generic dat^low. 
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Fig. 8 





EP 0926 594 A1 




Figure: 9a Two Dataflow Trees 




Figure: 9b Merged Composition Figure; 9c New Supertree 




Figure: 9d Next Candidate Mapping 
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